Explore and summarize the white wine quality dataset

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
## [1] 4898   13

White wine quality dataset

The white wine quality dataset consists of data from 4898 samples of the
Portuguese Vinho Verde white wine. There are 11 input variables, based on
physicochemical tests and one output variable “quality”" based on grades given
by wine experts.

Research questions

The goal of this investigation is to find out if there are related variables in
this dataset and which of the pysicochemical variables have a relationship with
the dependent variable quality.

Being only a wine drinker and not a wine expert, I decided to read more about
the chemical composition of white wine first in order to understand what kind
of data is present in the dataset. Please refer to the readme-document for a
list of sources.

Acids greatly contribute to the taste of wine (source: Waterhouse Lab). However,
volatile acid is undesirable and should be below 1.2 g/dm3 (source: Winefolly).
pH is a measure of active acidity: the lower the pH, the higher the acidity
and vice versa.

Furthermore, for each type of wine there is an optimal range of alcohol
percentage, for Vinho Verde this range is 8% till 11.5% (source: Wikipedia).
The same is more or less true for residual sugar level: sugar is also related
to the type of wine, so it would be interesting to see if there is an “optimal”
level of residual sugar for good quality Vinho Verde wines.

Although some sulfites are produced by the alcohol fermentation process,
sulfur dioxide (SO2) is usually added to wine as a preservative
(source: Winobrothers).
The total amount of sulfur dioxide is the sum of the amount of free sulfur
dioxide and bound sulfur dioxide. In the dataset, only the total and free
amount of sulfur dioxide are present. In the description of the dataset, it is
stated that concentrations of free SO2 of 50 ppm and higher becomes evident in
nose and taste. Therefore, it would be interesting to see if higher values
(> 50 ppm) for free SO2 result in lower quality ratings. In the dataset, there
is another related variable “sulphates”. This is potassium sulphate, a wine
additive which can contribute to sulfur dioxide gas (S02) levels.

Although I will try to explore all relations between the variables in the
dataset, this initial reading did give me some more specific questions to
focus on as well. In summary:

There are two important observations about the dataset that are important if we
were to build a model:

Univariate Plots Section

Although in the data description it is stated that there were no missing
I first checked if this was true. Indeed, no missing values were returned.

##                    X        fixed.acidity     volatile.acidity 
##                    0                    0                    0 
##          citric.acid       residual.sugar            chlorides 
##                    0                    0                    0 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                    0                    0                    0 
##                   pH            sulphates              alcohol 
##                    0                    0                    0 
##              quality 
##                    0
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

After inspecting the structure and the first five rows of the dataset, there are
three changes I would like to make:

  1. Leave out the column “X” which contains the row numbers and serves just as
    an identifier. This column is not useful for the analysis.

  2. Although the variable quality is expressed as an integer, it is in fact a
    factor variable. Therefore, I want to transform this variable to a factor
    variable. I decide to keep the integer version of variable quality as well.

  3. Seven levels for the factor variable quality is quite a high number, so I
    create an additional factor variable quality.bin that groups quality ratings
    together: ratings 3 and 4 are poor quality, ratings 5, 6 and 7 are average and
    ratings 8 and 9 are good qualit wines. By using this variable, I might be able
    to see more clear patterns than with seven levels.

First, I want to get a feel for the distribution of values in the dataset.
To do this, I run a summary on all variables.

##  fixed.acidity    volatile.acidity  citric.acid           pH       
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   :2.720  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.:3.090  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median :3.180  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   :3.188  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.:3.280  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :3.820  
##                                                                    
##     alcohol      residual.sugar      density         chlorides      
##  Min.   : 8.00   Min.   : 0.600   Min.   :0.9871   Min.   :0.00900  
##  1st Qu.: 9.50   1st Qu.: 1.700   1st Qu.:0.9917   1st Qu.:0.03600  
##  Median :10.40   Median : 5.200   Median :0.9937   Median :0.04300  
##  Mean   :10.51   Mean   : 6.391   Mean   :0.9940   Mean   :0.04577  
##  3rd Qu.:11.40   3rd Qu.: 9.900   3rd Qu.:0.9961   3rd Qu.:0.05000  
##  Max.   :14.20   Max.   :65.800   Max.   :1.0390   Max.   :0.34600  
##                                                                     
##  free.sulfur.dioxide total.sulfur.dioxide   sulphates         quality     
##  Min.   :  2.00      Min.   :  9.0        Min.   :0.2200   Min.   :3.000  
##  1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.4100   1st Qu.:5.000  
##  Median : 34.00      Median :134.0        Median :0.4700   Median :6.000  
##  Mean   : 35.31      Mean   :138.4        Mean   :0.4898   Mean   :5.878  
##  3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.5500   3rd Qu.:6.000  
##  Max.   :289.00      Max.   :440.0        Max.   :1.0800   Max.   :9.000  
##                                                                           
##  quality.f  quality.bin  
##  3:  20    poor   : 183  
##  4: 163    average:4535  
##  5:1457    good   : 180  
##  6:2198                  
##  7: 880                  
##  8: 175                  
##  9:   5

Two things stand out when looking at the summaries:

Plotting histograms of all the variables confirms the observations made from
the variable summaries: there are some extreme outliers (sometimes not even
visible in these small plots but recognizable from the stretched out x-axes.
The distribution for sulphates seems to be bimodal. Apart from the extreme
outliers, the distributions for most variables seem more or less normal.

In order to reduce skewness of the distributions, I try to use a log scale for
each of the variables. Unfortunately, for most of the variables the skewness
remains. Instead I try to set sensible binwidths and limits to the x-axis.

After some experimenting, I feel that these plots capture the distributions of
the bulk of the values best. I first tried to leave out the highest 5% of
observations but the number of rows dropped would be too high in my opinion.
When taking the lowest 97.5%, the extreme outliers are ignored in the plot
which leads to cleaner pictures. I will use these settings in the
multivariate analyses as well.

There are two factor variables in the dataset: quality.f (a variable that
contains all original quality ratings from 3 to 9) and quality.bin (a variable
that contains combined levels from the original variable). Both variables will
be used during analysis; sometimes using the bins can reveal a bigger picture
or trend that cannot be seen when all levels are present, on the other hand
the original levels show more detailed trends than the combined levels can.

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
##    poor average    good 
##     183    4535     180

The bar charts for both variables that binning the variables does not solve the
problem of the highly imbalanced dataset. Combining the levels does create
larger groups for poor and high quality wines however, which allows us to make
statements about these two groups (where the groups of 20 samples for quality
rating 3 and 5 samples for quality rating 9 were too small).

Univariate Analysis Conclusion

The most important finding in the univariate analysis is that there are extreme
outliers for almost all physicochemical variables. At this point in the
analysis, it is not yet clear if there is a pattern in the extreme outliers
(for example: are extreme outliers related to low or just high quality?),
but it will certainly be interesting to do some analyses with and without
outliers.

Bivariate Plots Section

A few observations from the boxplots:

Therefore, I create the boxplots again but use the quality bins instead, which
leads to slightly larger groups for poor (rating 3 and 4) and good (ratings 8
and 9) quality wines.

Please note that I did not limit or transform the variables. The boxplots
were meant to show where the outliers are for each variable.

The boxplots created with the quality bins show some more information:

Next, I will run some ggpairs plots to see some possible relations between
variables in one glance. Unfortunately, including all variables in the plot
results in a plot that is unreadable, so I decided to group variables.

Please note that I did not limit or transform the variables. The boxplots
were meant to show where the outliers are for each variable.

The same applies to the plots below. Since I have no reason to believe that the
outliers are bad measurements, they should be included when calculating and
visualizing relationships. The outliers are removed only in the multivariate
plots to make the plots more interpretable.

##   quality.bin      mean
## 1        poor 0.3759836
## 2     average 0.2743076
## 3        good 0.2779722

Observations:

Observations:

Observations:

To see all relations in one table, I create a correlation matrix:

##                      fixed.acidity volatile.acidity citric.acid    pH
## fixed.acidity                 1.00            -0.02        0.29 -0.43
## volatile.acidity             -0.02             1.00       -0.15 -0.03
## citric.acid                   0.29            -0.15        1.00 -0.16
## pH                           -0.43            -0.03       -0.16  1.00
## alcohol                      -0.12             0.07       -0.08  0.12
## residual.sugar                0.09             0.06        0.09 -0.19
## density                       0.27             0.03        0.15 -0.09
## chlorides                     0.02             0.07        0.11 -0.09
## free.sulfur.dioxide          -0.05            -0.10        0.09  0.00
## total.sulfur.dioxide          0.09             0.09        0.12  0.00
## sulphates                    -0.02            -0.04        0.06  0.16
## quality                      -0.11            -0.19       -0.01  0.10
##                      alcohol residual.sugar density chlorides
## fixed.acidity          -0.12           0.09    0.27      0.02
## volatile.acidity        0.07           0.06    0.03      0.07
## citric.acid            -0.08           0.09    0.15      0.11
## pH                      0.12          -0.19   -0.09     -0.09
## alcohol                 1.00          -0.45   -0.78     -0.36
## residual.sugar         -0.45           1.00    0.84      0.09
## density                -0.78           0.84    1.00      0.26
## chlorides              -0.36           0.09    0.26      1.00
## free.sulfur.dioxide    -0.25           0.30    0.29      0.10
## total.sulfur.dioxide   -0.45           0.40    0.53      0.20
## sulphates              -0.02          -0.03    0.07      0.02
## quality                 0.44          -0.10   -0.31     -0.21
##                      free.sulfur.dioxide total.sulfur.dioxide sulphates
## fixed.acidity                      -0.05                 0.09     -0.02
## volatile.acidity                   -0.10                 0.09     -0.04
## citric.acid                         0.09                 0.12      0.06
## pH                                  0.00                 0.00      0.16
## alcohol                            -0.25                -0.45     -0.02
## residual.sugar                      0.30                 0.40     -0.03
## density                             0.29                 0.53      0.07
## chlorides                           0.10                 0.20      0.02
## free.sulfur.dioxide                 1.00                 0.62      0.06
## total.sulfur.dioxide                0.62                 1.00      0.13
## sulphates                           0.06                 0.13      1.00
## quality                             0.01                -0.17      0.05
##                      quality
## fixed.acidity          -0.11
## volatile.acidity       -0.19
## citric.acid            -0.01
## pH                      0.10
## alcohol                 0.44
## residual.sugar         -0.10
## density                -0.31
## chlorides              -0.21
## free.sulfur.dioxide     0.01
## total.sulfur.dioxide   -0.17
## sulphates               0.05
## quality                 1.00

In this matrix, I can see the following moderate (0.4-0.59) to strong (0.6-0.79)
correlations:
* fixed acidity and pH (-0.43)
* alcohol and residual sugar (-0.45)
* alcohol and density (-0.78)
* alcohol and total sulfur dioxide (-0.45)
* density and residual sugar (0.84)
* density and total sulfur dioxide (0.53)
* residual sugar and total sulfur dioxide (0.40)
* free and total sulfur dioxide (0.62)

In a multivariate analysis, it will be especially interesting to explore two
variables that are not interrelated like alcohol or residual sugar with
total sulfur dioxide, grouped by quality rating.

To find correlation between the (continuous) physicochemical variables and the
(ordinal) variable quality, I use Spearman’s correlation:

## 
##  Spearman's rank correlation rho
## 
## data:  ww$fixed.acidity and ww$quality
## S = 2.1239e+10, p-value = 3.183e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.08448545
## 
##  Spearman's rank correlation rho
## 
## data:  ww$volatile.acidity and ww$quality
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1965617
## 
##  Spearman's rank correlation rho
## 
## data:  ww$citric.acid and ww$quality
## S = 1.9225e+10, p-value = 0.1996
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.01833273
## 
##  Spearman's rank correlation rho
## 
## data:  ww$pH and ww$quality
## S = 1.7442e+10, p-value = 1.656e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1093621
## 
##  Spearman's rank correlation rho
## 
## data:  ww$alcohol and ww$quality
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4403692
## 
##  Spearman's rank correlation rho
## 
## data:  ww$residual.sugar and ww$quality
## S = 2.1191e+10, p-value = 8.822e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.08206979
## 
##  Spearman's rank correlation rho
## 
## data:  ww$density and ww$quality
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.348351
## 
##  Spearman's rank correlation rho
## 
## data:  ww$chlorides and ww$quality
## S = 2.5743e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.3144885
## 
##  Spearman's rank correlation rho
## 
## data:  ww$free.sulfur.dioxide and ww$quality
## S = 1.912e+10, p-value = 0.09703
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.02371338
## 
##  Spearman's rank correlation rho
## 
## data:  ww$total.sulfur.dioxide and ww$quality
## S = 2.3436e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1966803
## 
##  Spearman's rank correlation rho
## 
## data:  ww$sulphates and ww$quality
## S = 1.8932e+10, p-value = 0.01971
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.03331897

For all physicochemical variables except citric acid and free sulfur dioxide,
we can reject the null hypothesis that there is no association between the
variables and quality rating. However, none of those correlations are strong:
There is a moderate positive relation between alcohol and quality, and a weak
negative relation between density and quality and chlorides and alcohol.

## 
##  Spearman's rank correlation rho
## 
## data:  ca90$citric.acid and ca90$quality
## S = 1.2307e+10, p-value = 8.783e-10
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.09294089

To get a feeling for the impact of the extreme outliers on the strength of
the relationship between variables, I decided to pick one and visualize the
linear relationship with and without outliers. Of course, a scatterplot is not
the best choice for visualizing a categorical variable but it was just to get an
idea.

As it turns out, it is possible to turn a non-significant relationship into a
significant relationship, see the relation between citric acid and quality above.

However, doing this would only be legitimate if the outliers in the dataset
are an indication of faulty data but I have no reason to believe the outliers
are coming from bad data. Wine making is a natural process that apparently
sometimes results in more extreme values for the chemical properties.

Bivariate Analysis

Summary of the bivariate analysis:

When looking at relations between the chemical variables in the dataset, we can
see that there are only strong relationships between density and alcohol
(negative) and density and residual sugar (positive). Both alcohol and residual
sugar largely determine the density of wine.

Other relationships between chemical values are only moderately strong, such as
the relations between alcohol and total sulfur dioxide, density and total sulfur
dioxide and residual sugar and total sulfur dioxide.

For all physicochemical variables except citric acid and free sulfur dioxide,
we can reject the null hypothesis that there is no association between the
variables and quality rating. However, none of those correlations are strong:
there is a moderate positive relation between alcohol and quality, and a weak
negative relation between density and quality and chlorides and alcohol.

Multivariate Plots Section

In this section, I will explore the relationships found in the previous
section by adding a third variable.

The plot above shows the relation between alcohol and total sulfur dioxide. The
shape of the cloud matches the moderately strong negative correlation (-0.45).
Also, the relation between alcohol and quality is visible: the blue spots for
higher quality tend to be on the right side the plot (higher alcohol percentage).
However, the blue spots are spread out quite evenly over the vertical axis,
which means that there is not a strong relation between quality and total
sulfur dioxide.

For all plots in this section I leave out the highest 1% values to improve
readability of the plot.

This plot does show a relation between total sulfur dioxide and residual sugar
(the shape of the cloud moves in an slightly upward direction for the higher
values of both sulfur dioxide and residual sugar) but there does not seem to
be a relation between quality and both chemical variables. The blue and
red/orange spots for poor and good quality wines seem to be spread over both
the x- and y-axis quite evenly.

Density and residual sugar have the strongest correlation and it shows in the
plot. Also it is clear that higher quality wines tend to be on the lower side
of the cloud, meaning that high quality wines tend to have a lower density.
This was confirmed by the (rather weak) correlation coefficient of -0.35 for
density and quality.

Running the same plot but with the quality bins confirms the picture from the
previous plot: the good (green dots) quality wines tend to have a lower density
where poor quality wine (orange dots) tend to have a higher density.

Alcohol and density are negatively correlated (meaning: the higher the alcohol
percentage, the lower the density) but the same relation as found before is also
visible in this plot: higher quality wines tend to have higher alcohol
percentages.

Linear model

As we are predicting the outcome of a ordinal categorical variable, a linear
model cannot be used. Instead, ordinal logistic regression would be a good
choice for building a model for this dataset. However, I don’t have
experience with this regression method and therefore I am not certain to
interpret the results correctly. I will create such a model in a later stage.

Multivariate Analysis

The multivariate plots confirmed the relations found in the bivariate analysis.
The relationship between alcohol percentage and quality stands out, the variable
alcohol can be used as a predictor for the quality of wine. Also, the plot
for density versus residual sugar clearly shows that the higher quality wines
are on the lower side of the scattercloud, meaning that higher quality wines
tend to have lower levels of residual sugar and therefore lower density.


Final Plots and Summary

Histogram of chlorides

During univariate analysis, this plot gave a quick overview of the distribution
all numeric variables in the dataset. It showed the presence of extreme outliers,
as can be seen in the distribution for chlorides. By chosing a very small
binwidth we are able to see the very long right tail of the distribution.

##      75% 
## 21.14286

In fact, the maximum value is 21.14 times the interquartile range higher than
the value of the 3rd quartile, where 3 times the interquartile range above the
3rd quartile can already be seen as an extreme outlier (according to Tukey).
The variable chlorides is an extreme example, but there are more variables in
the dataset for which limits were required in order to make relations better
visible.

Density plot for alcohol grouped by quality

## 
##  Spearman's rank correlation rho
## 
## data:  ww$alcohol and ww$quality
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4403692

This plot is lifted out of the ggpairs plot in the bivariate section and shows
the relation between alcohol and quality. Please note that “density” in this
plot does not refer to the variable density but to the frequency density.

In this plot it is showed clearly that the distributions for the poor and
average wine differ from the distribution for good wines. Also, the mean
alcohol percentage for wine is higher than for the other two groups.

Scatterplot for residual sugar and density, grouped by quality

By choosing to use the quality bins instead of all levels of the variable quality
we can make the the poor and good quality wines stand out. The image confirms
that density and residual sugar have a strong correlation. Also it is clear
that higher quality wines tend to be on the lower side of the cloud, meaning
that high quality wines tend to have a lower density. This was confirmed by
the (rather weak) correlation coefficient of -0.35 for density and quality.

Summary

It is now time to answer the questions stated in the first part of the report:

  • The main question: are there physicochemical variables in this dataset that
    are good predictors of the quality of Vinho Verde white wine, and if yes,
    which variables are they?

There is only one variable (alcohol) that has a moderately strong positive
relationship with the quality of white wine. This variable can be used to
predict wine quality. With the exception of variables citric acid and free
sulfur dioxide, all other relations are weak but significant.

  • According to literature, acids greatly contribute to quality of wine. Is
    there a relationship between between acidity and wine quality?

The correlation tests for fixed acidity and citric acid showed that there is
a relationship between both variables and quality, but the relation is very
weak.

  • When the level of volatile acids is too high, it negatively affects taste.
    Is there a negative relation between volatile acidity and quality?

The level of volatile acidity becomes faulty when it exceeds 1.2 g/dm3. None
of the observations in the dataset did indeed exceed this value. However,
a comparison of the mean of the quality groups show that poor quality wines
have on average a higher level of volatile acidity.

  • According to wine making sources, the lower the pH, the higher the acidity is
    and vice versa. Can we see this (negative) relation in the this dataset as well?

This relationship is demonstrated most clearly for the variable fixed acidity
and pH. Indeed, there is a moderately strong negative relation.

  • The optimal alcohol percentage for Vinho Verde wines is between 8% and 11.5%.
    Can we see that wines with an alcohol percentage outside this range have lower
    quality ratings?

Plot 2 in the summary above shows rather the contrary: higher quality wines
tend to have higher levels of alcohol, with an average alcohol level for good
quality wine of more than 11.5%.

  • The taste of wine is impacted by sulfur dioxide levels higher than 50 ppm. Is
    there a negative relation between total sulfur dioxide level and quality?

There is a negative relation between total sulfur dioxide and quality, but
it is fairly weak with a correlation coefficient of -0.2.


Reflection

The first and arguably most important lesson that I learned when I started out
exploring this dataset about wine is that although in theory it is possible to
explore and visualize a well-documented dataset, in practice you will always
need some domain knowledge as a data analyst.

Second, when doing a course often datasets are used that are know to have
strong relationships between variables in them. This dataset learned me
that this is not always the case and that weak but significant relations
between variables are also relations.

This was just a first exploration. Further work can be done by using ordered
logistic regression to predict the quality of white wine. Once we have a model
we could see if this model also works for other types of white wine.
And there is also a dataset available for the red variant of Vinho Verde wines,
we could compare these two datasets.